Executive Summary

The report was commissioned to analyze sample of the claims data collected by auto-insurance provider, Indian Money, Bangalore, India. The purpose of the analysis is to predict the factors affecting the profitability of the company. The insurance industry in India is $17 billion industry and it is imperative to target the right audience with the right deals to stay abreast with the competition and keep up with profits. The two factors which determines profit for an insurance company are Premiums and Claims Amount. Revenue = Premium – Claim Amount (Keeping Operational Cost Constant)

Based on the research question – demographics (age group, gender and region) affecting profitability of the company and the initial analysis of the dataset; three hypotheses were formulated.

  1. The profitability for Indian Money is higher for female drivers in North Region. Explanation: Female drivers are more responsible as compared to male drivers and North region being the most affluent region; we speculated that there will be a high co-relation between female drivers in north region with profits.

  2. The profitability for Indian Money is higher for drivers above 40 years of age. Explanation: Experienced drivers drive cautiously as compared to young and inexperienced drivers.

  3. The profitability of Indian Money is higher for vehicles with cubic capacity more than 2200. Explanation: The vehicles with higher cubic capacity has higher premium and owners drive cautiously.

The revenue is calculated from premium and claim amount. Profit is calculated as proportion of revenue. The dependent variable is profit and independent variables were taken into accounts such as age group, gender, IDV (Insured Declare Values), Year of Manufacture and region (State, City) which can act as probable indicators to define profitability of Indian Money.

Statistical measures and visualization techniques are used to analyze the variables both at univariate and bivariate level. Transformation was performed, and outliers were removed for a normal distribution and a reliable linear model. If the variable is categorical and it contained too few observation, then multiple categories were clubbed, and new derived factor variable was formed.

Data Introduction

Dataset is a sample of data maintained by one of the auto-insurance providers named Indian Money, Bangalore, India. We have received the data from one of the colleague working with this company. Data is collected by the organization during claims processing and reporting of claim data at the end of the year. The data contains 7702 observations and 15 variables.

##   Policy.Number Year    IDV      City       State Cubic.Capacity
## 1       3214946 2009 650119 bangalore   karnataka           2200
## 2       3215858 2010 745688  kolkatta west bengal           1500
## 3       3216013 2007 236971  newdelhi         ncr           1200
## 4       3216152 2011 791024   gurgaon     haryana           1500
## 5       3216372 2009 259162 bangalore   karnataka           1000
##        Mfr.Model Premium      Type Gender Channel   Age    Cover.Type
## 1    Tata Safari   17427   Renewal   Male  Broker 55-64 Comprehensive
## 2     Honda City   22305 From Comp Female  Broker 18-24 Comprehensive
## 3   Maruti Swift    6573 From Comp Female  Direct 25-34 Comprehensive
## 4     Honda City   23261 From Comp   Male  Direct 45-54 Comprehensive
## 5 Maruti Wagon-R    7382 From Comp   Male  Direct 55-64 Comprehensive
##   PaymentFrequency ClaimsInd Claim.Amount  Zone     Vehicle_Cat Revenue
## 1           Annual         0            0 South  CC-large sized   17427
## 2          Monthly         0            0  East CC-medium sized   22305
## 3          Monthly         0            0 North  CC-small sized    6573
## 4           Annual         0            0 North CC-medium sized   23261
## 5          Monthly         1        80694 South  CC-small sized  -73312

Research Question

The auto insurance industry in India is estimated to have Dollar 17 billion value in 2025. Industry is exponential growing and reached to $15 billion in 2017. Companies are competing for dollars with each other and trying to reduce the claims filed by customers. Generating revenue and making good profit is the key to success and to stay ahead of the competition. We have received a sample containing 7702 observations collected during claim processing by the company over 7 years (2005 to 2011). We have tried to identify the factors which affect the probability of profit generation for the company.

We have focused our efforts on analysis of factors impacting the profitability of the company. We have finalized following research questions,

  1. What factors affect the profitability of Indian Money Insurance company?
  2. Is there an impact of geography when it comes to the profitability of auto insurance?
  3. Does Age Group or Gender affect the profitability of auto insurance?

Our Hypothesis

  1. The profitability of Indian Money is higher for female drivers in North zone.

  2. The profitability of Indian Money is higher for drivers having age more than 40 years.

  3. The profitability of Indian Money is higher for vehicles with cubic capacity more than 2200.

Variables used for Analysis

Name Data Type Variable Description
Year Factor CV 7 Years of Data
IDV Numeric IV Insured Declared value of Car
Gender Factor IV Male or Female
Zone Factor IV Divided Regions of the Country
Age Group Factor IV Age groups of Applicants
ClaimsInd Factor IV Claims Taken (0-not taken,1-Taken)
Vehicle Category Factor IV Clubbed to Cubic Capacity size
Revenue Numeric DV Derived from Premium and Claim data (Premium - Claim)
Revenue Numeric DV Derived column (Revenue %) (Revenue/ Max(Revenue) * 100)

Dataset has 15 variables and 7702 observations.
Dependent Variable: Premium and Claim Amount.
Independent variables: Age Group, IDV and Gender.
Derived variables: Zone, Vehicle Category, Revenue and Profit.

Dependent Variables IDV: This is Insured Declared Value of the car. This is the valuation of the car according to the rules of the insurance company.
Age Group: We originally had various conflicting age groups which were re-levelled. And finally, we have age groups: 18-24, 25-34, 35-44, 45-54, 55-64,65+
Gender: Male and Female.

Independent Variables Premium: This is the total premiums paid for the policy by a customer at the beginning of the year.
Claim Amount: This is the amount claimed by the customers.

Derived Variables Zone:
• Derived from state variable.
• Divided into four zones – North, South, East, West.
• Condition:
o North:
o South:
o East:
o West:
• Reason: Very few observations in most states.

Vehicle Category: • Derived from Cubic Capacity Variable.
• Divided into three- vehicle categories - CC-large sized, CC-medium sized, CC-small sized.
• Condition:
o CC-large sized: cubic capacity >1800
o CC-medium sized: cubic capacity >1250 and <1800
o CC-small sized: cubic capacity <1250
• Reason: Very few observations in certain categories of cubic capacity.

Revenue:
• Derived from premium and claim amount.
• Revenue = Premium – Claim Amount.
• Reason: Revenue generated per person.

Profit:
• Derived from revenue.
• Profit = Revenue/ Max(Revenue) * 100.
• Reason: Profit as a proportion of revenue.


Univariate Analysis

Numeric Variable Analysis

PROFIT

##    vars    n  mean   sd median trimmed mad min max range  skew kurtosis
## X1    1 7702 80.45 6.32  81.95   81.51 1.3   0 100   100 -3.14    18.23
##      se
## X1 0.07

Analysis: The profit data does not show a normal distribution and it is negatively skewed. The boxplot shows we have many outliers.

IDV

##    vars    n     mean       sd median  trimmed      mad    min     max
## X1    1 7702 385617.3 246733.8 305093 343581.4 136637.9 111822 1790603
##      range skew kurtosis      se
## X1 1678781 2.08     5.63 2811.43

Analysis: IDV variable is right skewed because of vehicles which have higher insurance declared value. The boxplot shows we have many outliers.

Factors Variable Analysis

ZONE

## 
##  East North South  West 
##   935  4163  1728   876
## 
##     East    North    South     West 
## 12.13970 54.05090 22.43573 11.37367

Analysis: Zone is categorical variable which is derived from State variable in the dataset. North zone has the most of the observations.

Age

## 
## 18-24 25-34 35-44 45-54 55-64   65+ 
##  1145  1500  2089  1654   782   532
## 
##     18-24     25-34     35-44     45-54     55-64       65+ 
## 14.866269 19.475461 27.122825 21.474942 10.153207  6.907297

Analysis: Age group is categorical variable which groups drivers in 6 categories. The age group 35-44 has more observations as compared to the other age groups.

Gender

## 
## Female   Male 
##   2134   5568
## 
##   Female     Male 
## 27.70709 72.29291

Analysis: Gender is a categorical variable. Dataset has more male drivers than female drivers.

Vehicle_cat

## 
##  CC-large sized CC-medium sized  CC-small sized 
##            1023            2571            4108
## 
##  CC-large sized CC-medium sized  CC-small sized 
##        13.28226        33.38094        53.33680

Analysis: Vehicle category is a derived variable from cubic capacity variable in the dataset. It groups vehicles in 3 different categories. The small-sized vehicles has most of the obsevations.

Year

## 
## 2005 2006 2007 2008 2009 2010 2011 
##  272  693  870  988 1154 1980 1745
## 
##       2005       2006       2007       2008       2009       2010 
## 0.03531550 0.08997663 0.11295767 0.12827837 0.14983121 0.25707608 
##       2011 
## 0.22656453

ClaimsInd

## 
##    0    1 
## 5825 1877

Bivariate Analysis

HANDLING OUTLIERS Creating function outlier to retrieve extreme outliers i.e + or - 3 times IQR of upper quartile an lower quartile respectively

outliers <- function(column) {
  lowerq <- as.vector(quantile(column)[2]) # returns 1st quartile
  upperq <- as.vector(quantile(column)[4]) # returns 1st quartile
  iqr <- upperq-lowerq  
  extreme.outliers.upper <- (iqr * 3) + upperq
  extreme.outliers.lower <- lowerq - (iqr * 3)
  extreme.outliers<-which(column > extreme.outliers.upper 
                          | column < extreme.outliers.lower)
  print(paste("Extreme outlier:", extreme.outliers))
  return(extreme.outliers)
}

Two Numeric variable analysis i.e profit vs IDV

## [1] 0.2067221

Analysis: No correlation, between profit and IDV with original data. From our univariate analysis we saw we have highly skewed data for IDV. So we will get all extreme outliers from our IDV variable

Also from our univariate analysis we saw we have highly skewed data for Profit too So get all extreme outliers from our profit variable

## [1] 0.7021029
  • Much better positive correlation

% imrovement in correlation value after handling outliers from both numeric variables.

## [1] 239.6361

Two Factors variable analysis

Compairing claim with gender,age,zone,revenue,IDV,Vehicle category.

ClaimsInd and gender

Analysis: We can see from the plot that female claim slightly more than male.

Claimed vs age

Analysis: From the plot we can see that age range 35-44 claims slightly more than other age range.

Claimed vs Zone

Analysis: From the plot, we can see north zone claims slightly more than other zones.

Claimed vs vehicle category

Analysis: From the plot, we can see that the number of claims made doesn’t vary with the vehicle category .

ANALYSIS OF FACTOR AND NUMERIC

Profit among Gender

##   ins_rm_ex_IDV_pro$Gender ins_rm_ex_IDV_pro$profit
## 1                   Female                 82.48491
## 2                     Male                 82.55055
## # A tibble: 2 x 3
##   Gender      avg      std
##   <fctr>    <dbl>    <dbl>
## 1 Female 82.48491 1.638886
## 2   Male 82.55055 1.599992

Analysis: From the box plot we see that there’s not much difference in profit by gender as the mean for both the data are equal. After performing the Anova testing we got the p-value is greater than 0.05, which means that with 95% confidence interval we cannot reject the null hypothesis that there is no relation between profit and IDV. So, our Profit is not affected by gender.

Doing anova testing to verify relation.

##               Df Sum Sq Mean Sq F value Pr(>F)
## Gender         1      5   5.244   2.022  0.155
## Residuals   6233  16165   2.593

p-value 0.155, we fail to reject null hypothesis,hence there is no relation b/w profit by gender

Profit among Age

##   ins_rm_ex_IDV_pro$Age ins_rm_ex_IDV_pro$profit
## 1                 18-24                 82.56411
## 2                 25-34                 82.59072
## 3                 35-44                 82.46932
## 4                 45-54                 82.55855
## 5                 55-64                 82.55965
## 6                   65+                 82.43581
## # A tibble: 6 x 3
##      Age      avg      std
##   <fctr>    <dbl>    <dbl>
## 1  18-24 82.56411 1.655323
## 2  25-34 82.59072 1.600675
## 3  35-44 82.46932 1.657083
## 4  45-54 82.55855 1.546252
## 5  55-64 82.55965 1.613356
## 6    65+ 82.43581 1.547866

Analysis: From the box plot we see that there’s not much difference in profit by zone, the mean for both the data are equal. After performing the Anova testing we got the p-value is greater than 0.05, which means that with 95% confidence interval we cannot reject the null hypothesis that there is no relation between profit and Age Group. So, our Profit is not affected by Age Group.

Doing anova testing to verify relation.

##               Df Sum Sq Mean Sq F value Pr(>F)
## Age            5     17   3.427   1.322  0.252
## Residuals   6229  16153   2.593

We fail to reject null hypothesis, hence there is no relation b/w profit by gender

Profit among Zone

##   ins_rm_ex_IDV_pro$Zone ins_rm_ex_IDV_pro$profit
## 1                   East                 82.57871
## 2                  North                 82.51103
## 3                  South                 82.57034
## 4                   West                 82.51009
## # A tibble: 4 x 3
##     Zone      avg      std
##   <fctr>    <dbl>    <dbl>
## 1   East 82.57871 1.557689
## 2  North 82.51103 1.620769
## 3  South 82.57034 1.656531
## 4   West 82.51009 1.525994

Doing anova testing to verify relation.

##               Df Sum Sq Mean Sq F value Pr(>F)
## Zone           3      6   1.872   0.722  0.539
## Residuals   6231  16165   2.594

We fail to reject null hypothesis, hence there is no relation b/w profit by gender

Profit among Vehicle Category

##   ins_rm_ex_IDV_pro$Vehicle_Cat ins_rm_ex_IDV_pro$profit
## 1                CC-large sized                 84.35655
## 2               CC-medium sized                 82.91999
## 3                CC-small sized                 81.88915
## # A tibble: 3 x 3
##       Vehicle_Cat      avg      std
##            <fctr>    <dbl>    <dbl>
## 1  CC-large sized 84.35655 1.741308
## 2 CC-medium sized 82.91999 1.634476
## 3  CC-small sized 81.88915 1.093841

Analysis: From the box plot we see that there’s not much difference in profit by zone, the mean for both the data are equal. After performing the Anova testing we got the p-value is greater than 0.05, which means that with 95% confidence interval we cannot reject the null hypothesis that there is no relation between profit and Age Group. So, our Profit is not affected by Age Group.

Doing anova testing to verify relation.

##               Df Sum Sq Mean Sq F value              Pr(>F)    
## Vehicle_Cat    2   4272  2136.0    1119 <0.0000000000000002 ***
## Residuals   6232  11898     1.9                                
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = profit ~ Vehicle_Cat, data = ins_rm_ex_IDV_pro)
## 
## $Vehicle_Cat
##                                     diff       lwr        upr p adj
## CC-medium sized-CC-large sized -1.436563 -1.573586 -1.2995402     0
## CC-small sized-CC-large sized  -2.467401 -2.596851 -2.3379506     0
## CC-small sized-CC-medium sized -1.030837 -1.121245 -0.9404296     0

There is a relation between profit and vehicle category, With large-sized vehicle,the most profitable for the company

*Note:
+We have removed the outliers (1279+188) but we have also compared our analysis with the original data for all factors. +There is no major change in the trend of our anlysis, so we can safely assume to remove our outliers.

Regression Modelling

trying few transformation for a better distribution of profit

No effect of taking log;
No effect of taking sqrt;
Will cotinue our analysis with default;

Model #1

Creating model by taking only numeric variable i.e IDV

## 
## Call:
## lm(formula = profit ~ IDV, data = ins_rm_ex_IDV_pro)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.5617 -0.2535 -0.0843  0.2650  6.1736 
## 
## Coefficients:
##                   Estimate     Std. Error t value            Pr(>|t|)    
## (Intercept) 80.43711723966  0.03059346105 2629.23 <0.0000000000000002 ***
## IDV          0.00000577385  0.00000007417   77.84 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.147 on 6233 degrees of freedom
## Multiple R-squared:  0.4929, Adjusted R-squared:  0.4929 
## F-statistic:  6060 on 1 and 6233 DF,  p-value: < 0.00000000000000022

Model #2

Creating model by taking all the variable

## 
## Call:
## lm(formula = profit ~ IDV + Gender + Vehicle_Cat + Age + Zone + 
##     Year, data = ins_rm_ex_IDV_pro)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.6810 -0.2087  0.0409  0.3057  6.0525 
## 
## Coefficients:
##                                 Estimate    Std. Error t value
## (Intercept)                80.7162062784  0.1190821358 677.820
## IDV                         0.0000065859  0.0000001234  53.353
## GenderMale                  0.0842517237  0.0323590474   2.604
## Vehicle_CatCC-medium sized  0.0364740376  0.0548477614   0.665
## Vehicle_CatCC-small sized   0.1876735984  0.0670843519   2.798
## Age25-34                    0.0262043764  0.0489872183   0.535
## Age35-44                   -0.0768689473  0.0458444205  -1.677
## Age45-54                    0.0329628583  0.0476694745   0.691
## Age55-64                   -0.0425918737  0.0575798498  -0.740
## Age65+                     -0.0267083154  0.0649921829  -0.411
## ZoneNorth                  -0.0909098648  0.0441521145  -2.059
## ZoneSouth                  -0.0342999740  0.0495941442  -0.692
## ZoneWest                   -0.1267021763  0.0579187750  -2.188
## Year2006                   -0.0975678394  0.0872213229  -1.119
## Year2007                   -0.3580491535  0.0848796624  -4.218
## Year2008                   -0.6176543067  0.0847941500  -7.284
## Year2009                   -0.8296783768  0.0841800387  -9.856
## Year2010                   -0.8760355204  0.0826890728 -10.594
## Year2011                   -0.8467044504  0.0848576902  -9.978
##                                        Pr(>|t|)    
## (Intercept)                < 0.0000000000000002 ***
## IDV                        < 0.0000000000000002 ***
## GenderMale                              0.00925 ** 
## Vehicle_CatCC-medium sized              0.50607    
## Vehicle_CatCC-small sized               0.00516 ** 
## Age25-34                                0.59272    
## Age35-44                                0.09364 .  
## Age45-54                                0.48928    
## Age55-64                                0.45951    
## Age65+                                  0.68113    
## ZoneNorth                               0.03953 *  
## ZoneSouth                               0.48921    
## ZoneWest                                0.02874 *  
## Year2006                                0.26334    
## Year2007                      0.000024963290449 ***
## Year2008                      0.000000000000364 ***
## Year2009                   < 0.0000000000000002 ***
## Year2010                   < 0.0000000000000002 ***
## Year2011                   < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.119 on 6216 degrees of freedom
## Multiple R-squared:  0.5183, Adjusted R-squared:  0.5169 
## F-statistic: 371.6 on 18 and 6216 DF,  p-value: < 0.00000000000000022

Checking for Multicollinearity. using VIF (Variance Inflation Factor)

vif(mod.2) 
##                 GVIF Df GVIF^(1/(2*Df))
## IDV         2.907495  1        1.705138
## Gender      1.017047  1        1.008488
## Vehicle_Cat 2.448380  2        1.250892
## Age         1.011660  5        1.001160
## Zone        1.008092  3        1.001344
## Year        1.521716  6        1.035606
sqrt(vif(mod.2)) > 2 
##              GVIF    Df GVIF^(1/(2*Df))
## IDV         FALSE FALSE           FALSE
## Gender      FALSE FALSE           FALSE
## Vehicle_Cat FALSE FALSE           FALSE
## Age         FALSE  TRUE           FALSE
## Zone        FALSE FALSE           FALSE
## Year        FALSE  TRUE           FALSE

If any variable is true, we would need to drop it. Year & Age need to go.

Model #3

Creating model by dropping age and year

## 
## Call:
## lm(formula = profit ~ IDV + Gender + Vehicle_Cat + Zone, data = ins_rm_ex_IDV_pro)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.4266 -0.2533 -0.0762  0.2588  6.1029 
## 
## Coefficients:
##                                 Estimate    Std. Error t value
## (Intercept)                80.7526245520  0.0911674122 885.762
## IDV                         0.0000055037  0.0000001033  53.293
## GenderMale                  0.0706957681  0.0328415068   2.153
## Vehicle_CatCC-medium sized -0.2070672376  0.0536704401  -3.858
## Vehicle_CatCC-small sized  -0.2571246921  0.0617809097  -4.162
## ZoneNorth                  -0.0816975666  0.0451195235  -1.811
## ZoneSouth                  -0.0160514480  0.0506483566  -0.317
## ZoneWest                   -0.1178341334  0.0591955365  -1.991
##                                        Pr(>|t|)    
## (Intercept)                < 0.0000000000000002 ***
## IDV                        < 0.0000000000000002 ***
## GenderMale                             0.031386 *  
## Vehicle_CatCC-medium sized             0.000115 ***
## Vehicle_CatCC-small sized              0.000032 ***
## ZoneNorth                              0.070237 .  
## ZoneSouth                              0.751315    
## ZoneWest                               0.046569 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.145 on 6227 degrees of freedom
## Multiple R-squared:  0.4954, Adjusted R-squared:  0.4948 
## F-statistic: 873.2 on 7 and 6227 DF,  p-value: < 0.00000000000000022

R-squared: 49.48%; back to close to 1st model have to drop zone as very less significant

Model #4

Creating model by taking only variables with which we have got relation in our testing and adding Gender

## 
## Call:
## lm(formula = profit ~ IDV + Vehicle_Cat + Gender, data = ins_rm_ex_IDV_pro)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.4422 -0.2508 -0.0791  0.2593  6.1466 
## 
## Coefficients:
##                                 Estimate    Std. Error t value
## (Intercept)                80.6907726278  0.0827948160 974.587
## IDV                         0.0000055018  0.0000001033  53.258
## Vehicle_CatCC-medium sized -0.2054602269  0.0536739749  -3.828
## Vehicle_CatCC-small sized  -0.2565468917  0.0618012713  -4.151
## GenderMale                  0.0725237512  0.0328280543   2.209
##                                        Pr(>|t|)    
## (Intercept)                < 0.0000000000000002 ***
## IDV                        < 0.0000000000000002 ***
## Vehicle_CatCC-medium sized              0.00013 ***
## Vehicle_CatCC-small sized             0.0000335 ***
## GenderMale                              0.02720 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.145 on 6230 degrees of freedom
## Multiple R-squared:  0.4948, Adjusted R-squared:  0.4945 
## F-statistic:  1525 on 4 and 6230 DF,  p-value: < 0.00000000000000022

R-squared: 49.45%; no improvement still.

Now Regression diagnostics on Model #4

## integer(0)

##        1031        3494         532        4025        2029        3933 
## 0.003376116 0.003376724 0.003386327 0.003600780 0.003691551 0.003846753
## 
## Call:
## lm(formula = profit ~ IDV + Vehicle_Cat + Gender, data = ins_rm_ex_IDV_pro_s)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.4426 -0.2508 -0.0790  0.2593  6.1471 
## 
## Coefficients:
##                                 Estimate    Std. Error t value
## (Intercept)                80.6902889175  0.0828297571 974.170
## IDV                         0.0000055023  0.0000001034  53.238
## Vehicle_CatCC-medium sized -0.2049044771  0.0537076679  -3.815
## Vehicle_CatCC-small sized  -0.2563707587  0.0618365396  -4.146
## GenderMale                  0.0723120275  0.0328399556   2.202
##                                        Pr(>|t|)    
## (Intercept)                < 0.0000000000000002 ***
## IDV                        < 0.0000000000000002 ***
## Vehicle_CatCC-medium sized             0.000137 ***
## Vehicle_CatCC-small sized             0.0000343 ***
## GenderMale                             0.027705 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.145 on 6226 degrees of freedom
## Multiple R-squared:  0.4947, Adjusted R-squared:  0.4944 
## F-statistic:  1524 on 4 and 6226 DF,  p-value: < 0.00000000000000022
## 3471  661 4305 3127 3809 4503 3795 4296 7554 4613 
##    0    0    0    0    0    0    0    0    0    0

Model #5

Creating model by by removed outlier data from model 4

## 
## Call:
## lm(formula = profit ~ IDV + Gender + Vehicle_Cat, data = ins_rm_ex_IDV_pro_r)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.4417 -0.2509 -0.0791  0.2600  6.1466 
## 
## Coefficients:
##                                 Estimate    Std. Error t value
## (Intercept)                80.6914262928  0.0829007427 973.350
## IDV                         0.0000055003  0.0000001034  53.171
## GenderMale                  0.0725144261  0.0328781918   2.206
## Vehicle_CatCC-medium sized -0.2054117718  0.0537545919  -3.821
## Vehicle_CatCC-small sized  -0.2569204624  0.0619026004  -4.150
##                                        Pr(>|t|)    
## (Intercept)                < 0.0000000000000002 ***
## IDV                        < 0.0000000000000002 ***
## GenderMale                             0.027452 *  
## Vehicle_CatCC-medium sized             0.000134 ***
## Vehicle_CatCC-small sized             0.0000336 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.146 on 6221 degrees of freedom
## Multiple R-squared:  0.4943, Adjusted R-squared:  0.494 
## F-statistic:  1520 on 4 and 6221 DF,  p-value: < 0.00000000000000022

R-squared: 49.4%; no improvement from model.4 So finally going forward with model#4 so we can explain variance in our model at best by 49.45%

## `geom_smooth()` using method = 'gam'

The plot above shows that IDV is linearly related to profit, for large sized vehicle it is most linear, and for medium sized the plot is linear after a certain point and for small sized vehicle also it is also linear after a certain point.

The model has _R-square value of 49.4%._   
o Per unit IDV increase of increase the profit by 0.0000055.  
o Gender brings 0.07251 more profit than Female gender.  
o Medium sized vehicle brings 0.205411 less profit than large size vehicle.  
o Small sized vehicle brings 0.2569204 less profit than large size vehicle.  

Limitations

There are many other factors like experience of driver, driver’s state of mind and severity of accident which may further affect the profitability and affect the models (r-square) value.

Conclusion

As per our analysis based on the available dataset,

Profit is not dependent on age & gender of the driver.
Profit is not dependent on any geographical zone.
Profit of the company increases with the increase in cubic capacity of the vehicle.

We have analyzed the dataset for identifying the factors impacting the revenue and profit generation of the company. As per our analysis, factors such as age of the driver, gender of the driver and geographical zone which driver belongs, don’t affect the profit of the company. There might be several reasons for these findings which we cannot explore due to limitations imposed by the data.
The personal traits of the driver and traffic in specific region affects the probability of accidents which in turn impacts the revenue. Experience, vision, state of mind while driving, alcohol consumption and medication consumed during driving might influence the probability of the accidents.
Auto insurance companies in India don’t consider the gender and age of the drivers to calculate the insurance premiums. Instead age of the vehicle and history of accidents is considered to define the premiums. We are recommending company to focus on insuring the vehicles with higher IDV values.